Variables can be mapped in R using readily available CRAN packages such as rgdal and maps. The focus of this tutorial will be on mapping polygon vector (aka areal) data and not on raster (aka field) data.
maps packageMapping values requires two datasets: the values to be mapped and the spatial boundaries of the areas being mapped. The maps package offers R-ready maps for US states/counties and administrative boundaries for other countries such as Canada, France and Italy.
In the following example we will use ggplot to draw the US counties map. We will also color the counties based on the state they are located in. The state name column is labeled region and the county name column is labeled subregion.
library(ggplot2)
library(maps)
# Load the county data from the maps package
cnty <- map_data("county")
# Plot the counties
ggplot(cnty, aes(long,lat, group = group)) +
geom_polygon(aes(fill = region), colour = rgb(1,1,1,0.2)) +
theme(legend.position="none") +
coord_quickmap()The x and y values are assigned the long and lat columns that store the coordinate values for each vertex that delineates a county. The group parameter tells ggplot which polygon (county) a vertex is defining. The geom_polygon() function generates filled areas from those vertices. The colour parameter defines the polygon outline color. The option legend.position="none" prevents the legend from being displayed and the coord_quickmap() function forces the aspect ratio to approximate that of the Mercator projection.
If you want to change the map projection, you can substitute coord_quickmap() with coord_map(). The latter accepts as parameter projections available in the the mapproj package. For example, to project the continental US into a Bonne projection modify the above chunk as follows:
ggplot(cnty, aes(long,lat, group = group)) +
geom_polygon(aes(fill = region), colour = rgb(1,1,1,0.2)) +
theme(legend.position="none") +
coord_map("bonne", param=45) If we want to omit the background grid, we modify the chunk as follows (note that the theme() options gets overridden by the theme_void() option requiring that we pass the show.legend = FALSE option to geom_polygon).
ggplot(cnty, aes(long,lat, group = group)) +
geom_polygon(aes(fill = region), show.legend = FALSE, colour = rgb(1,1,1,0.2)) +
coord_map("bonne", param=45) +
theme_void()The next step is to join a dataset to the counties map. We will use the income-by-gender-education dataset used earlier in the course and limit the data to median income per person (B20004001 column).
library(dplyr)
df <- read.csv("http://mgimond.github.io/ES218/Data/Income_education.csv")
df1 <- df %>% select(subregion=County, region=State, B20004001 )Next, we need to join the census data to the county map using two columns: state name and county name (or region and subregion columns). Let’s compare the two dataframes by viewing the first few rows of each.
head(cnty) long lat group order region subregion
1 -86.50517 32.34920 1 1 alabama autauga
2 -86.53382 32.35493 1 2 alabama autauga
3 -86.54527 32.36639 1 3 alabama autauga
4 -86.55673 32.37785 1 4 alabama autauga
5 -86.57966 32.38357 1 5 alabama autauga
6 -86.59111 32.37785 1 6 alabama autauga
head(df1) subregion region B20004001
1 Autauga al 35881
2 Baldwin al 31439
3 Barbour al 25201
4 Bibb al 29016
5 Blount al 32035
6 Bullock al 26408
We note two problems. first, the cnty object encodes states using their full name and not their two letter abbreviation. Second, the county names in cnty are in all lower case. Before attempting to join the dataframes, we must first fix these discrepancies. We will choose to modify df so that its state and county names match those in the cnty (map) dataframe.
We will first create a state name/abbreviation look-up table using the built-in state.name and state.abb objects. Note that using these pre-existing objects will require that we add the District of Columbia to the table since it’s not present in either data objects.
st <- data.frame(region=tolower(state.name), State = tolower(state.abb))
st <- bind_rows(st, data.frame(region="district of columbia", State="dc") )Next, we join the st look-up table to df then make the additional changes needed to join the census data to the counties map. Note that we are overwriting the earlier instance of df1.
df1 <- df %>%
inner_join(st, by="State") %>%
mutate(subregion = tolower(County)) %>%
select(subregion, region, B20004001 )We can now join df1 to cnty.
cnty.df1 <- inner_join(cnty, df1, by=c("region", "subregion") )Now let’s map the income distribution.
ggplot(cnty.df1, aes(long, lat,group = group)) +
geom_polygon(aes(fill = B20004001), colour = rgb(1,1,1,0.2)) +
coord_quickmap()You may notice that the counties map does not seem complete; most notable is the absence of all Louisiana counties. Let’s look at the rows in cnty.df1 associated with the state of Louisiana.
filter(cnty.df1, region == "louisiana")[1] long lat group order region subregion B20004001
<0 rows> (or 0-length row.names)
There are no rows returned. This suggests that the Louisiana counties did not properly join. Let’s compare the Louisiana county names between df1 and cnty.
df1 %>% filter(region == "louisiana") %>% select(subregion) subregion
1 acadia parish
2 allen parish
3 ascension parish
4 assumption parish
5 avoyelles parish
6 beauregard parish
7 bienville parish
8 bossier parish
9 caddo parish
10 calcasieu parish
11 caldwell parish
12 cameron parish
13 catahoula parish
14 claiborne parish
15 concordia parish
16 de soto parish
17 east baton rouge parish
18 east carroll parish
19 east feliciana parish
20 evangeline parish
21 franklin parish
22 grant parish
23 iberia parish
24 iberville parish
25 jackson parish
26 jefferson parish
27 jefferson davis parish
28 lafayette parish
29 lafourche parish
30 lasalle parish
31 lincoln parish
32 livingston parish
33 madison parish
34 morehouse parish
35 natchitoches parish
36 orleans parish
37 ouachita parish
38 plaquemines parish
39 pointe coupee parish
40 rapides parish
41 red river parish
42 richland parish
43 sabine parish
44 st. bernard parish
45 st. charles parish
46 st. helena parish
47 st. james parish
48 st. john the baptist parish
49 st. landry parish
50 st. martin parish
51 st. mary parish
52 st. tammany parish
53 tangipahoa parish
54 tensas parish
55 terrebonne parish
56 union parish
57 vermilion parish
58 vernon parish
59 washington parish
60 webster parish
61 west baton rouge parish
62 west carroll parish
63 west feliciana parish
64 winn parish
cnty %>% filter(region == "louisiana") %>% select(subregion) %>% unique() subregion
1 acadia
45 allen
67 ascension
106 assumption
145 avoyelles
211 beauregard
243 bienville
279 bossier
372 caddo
461 calcasieu
493 caldwell
527 cameron
568 catahoula
666 claiborne
686 concordia
791 de soto
847 east baton rouge
897 east carroll
950 east feliciana
978 evangeline
1021 franklin
1098 grant
1145 iberia
1203 iberville
1258 jackson
1274 jefferson
1340 jefferson davis
1382 lafayette
1430 lafourche
1498 la salle
1547 lincoln
1573 livingston
1619 madison
1687 morehouse
1737 natchitoches
1823 orleans
1888 ouachita
1922 plaquemines
2044 pointe coupee
2118 rapides
2205 red river
2248 richland
2306 sabine
2354 st bernard
2435 st charles
2474 st helena
2493 st james
2515 st john the baptist
2551 st landry
2596 st martin
2681 st mary
2752 st tammany
2797 tangipahoa
2835 tensas
2906 terrebonne
2982 union
3009 vermilion
3070 vernon
3117 washington
3137 webster
3174 west baton rouge
3213 west carroll
3253 west feliciana
3307 winn
It turns out that Louisiana does not divide its administrative areas into counties but parishes instead. The income data (originally downloaded from the Census Bureau) follows through with that designation convention by adding the word “parish” to each of its administrative area names. We therefore need to remove all instances of parish in the subregion names associated with Louisiana. We will therefore need to recreate the df1 object:
library(stringr)
df1 <- df %>%
inner_join(st, by="State") %>%
mutate( subregion = tolower(County) ,
subregion = ifelse(region=="louisiana",
str_replace(subregion, " parish", ""), subregion)) %>%
select(subregion, region, B20004001 )Let’s re-join the dataframes and re-plot the map.
cnty.df1 <- inner_join(cnty, df1, by=c("region", "subregion") )
ggplot(cnty.df1, aes(long, lat,group = group)) +
geom_polygon(aes(fill = B20004001), colour = rgb(1,1,1,0.2)) +
coord_quickmap()Most of Louisiana’s parishes are now mapped, but we are still missing a few parishes as well as a few counties. This is a result of differences in spelling for two-word county and parish names. For example, df1 encodes St. Lucie county (Florida) as st. lucie whereas cnty omits the dot and encodes it as st lucie. Fixing these discrepancies will require additional labor. At this point, it may prove more fruitful to do away with state/county names as joining keys and use FIPS county codes instead. FIPS (Federal Information Processing Standards) codes assign each county/state a unique five number designation thus making it easier to join data to spatial features. However, neither our census data table nor the built-in counties map have FIPS codes. We will download another version of the income data (one that has FIPS codes). Note that FIPS codes are normally part of datasets provided by the Census Bureau. The maps package has a dataset called county.fips that can be used to match state/county names in the county map to FIPS codes.
Let’s download the FIPS version of the dataset. Note that we will not need to modify/add columns to the data since we will be joining this table to the county map using the FIPS code.
df <- read.csv("http://mgimond.github.io/ES218/Data/Income_education_with_FIPS.csv")
df1 <- df %>% select(FIPS, B20004001 )Next, we’ll load the county.fips dataset.
# Load the county.fips dataset
data(county.fips)
head(county.fips) fips polyname
1 1001 alabama,autauga
2 1003 alabama,baldwin
3 1005 alabama,barbour
4 1007 alabama,bibb
5 1009 alabama,blount
6 1011 alabama,bullock
Note that the state and county names are aggregated into a single column; this does not match the data structure in cnty where states and counties are stored in separate columns. This will require some minor data prepping.
cnty2 <- cnty %>%
mutate(polyname = paste(region,subregion,sep=",")) %>%
left_join(county.fips, by="polyname")Now that the FIPS codes are part of the county map dataframe, let’s join the income data table to the county map, then map the income values.
cnty2.df1 <- inner_join(cnty2, df1, by=c("fips" = "FIPS") )
ggplot(cnty2.df1, aes(long, lat,group = group)) +
geom_polygon(aes(fill = B20004001), colour = rgb(1,1,1,0.2)) +
coord_quickmap()This is an improvement over the last map. But there are still a few counties missing. After a closer scrutiny of the data, it seems that the county.fips table splits some counties into two or more sub-regions. For example, Accomack county (Virginia) is split into two regions (and thus into two different names):
county.fips %>% filter( grepl("^virginia,accomack", polyname) ) fips polyname
1 51001 virginia,accomack:main
2 51001 virginia,accomack:chincoteague
Fixing these discrepancies would require manual intervention and time. Another work around is to use the Census Bureau’s map files instead of maps’s built-in map. We will cover this in the next section.
The Census Bureau maintains its own library of administrative boundary shapefiles (shapefiles are popular map data formats). This may prove to be the best map source to use since its areal units are designed to match the records in the income data table. Loading a shapefile can easily be accomplished using the readOGR() function from the rgdal package. One option is to download the shapefile from the Census Bureau’s website (usually in a zipped format) then unpack it in a local project folder before reading it into the current session with readOGR(). But we can maximize the replicability of the workflow by writing a chunk of code that will download the zipped shapefile into a temporary directory before unzipping it and loading the shapefile into an active R session as shown below.
library(rgdal)
tmp <- tempdir()
link <- "http://www2.census.gov/geo/tiger/GENZ2010/gz_2010_us_050_00_20m.zip"
filename <- basename(link)
download.file(link, filename)
unzip(filename, exdir = tmp )
shapefile <- readOGR( dsn=tmp, layer = tools::file_path_sans_ext(filename) )OGR data source with driver: ESRI Shapefile
Source: "C:\Users\manu\AppData\Local\Temp\RtmpGqRXqM", layer: "gz_2010_us_050_00_20m"
with 3221 features
It has 6 fields
The shapefile is unzipped into a temporary folder, tmp, then read from that same directory and stored in a SpatialPolygonsDataFrame object called shapefile. The chunk tools::file_path_sans_ext(filename) extracts the name of the zip file (minus the extension) which turns out to also be the name of the shapefile. Note that if the zip file was manually unpacked in a project folder then the only command that would have needed to be executed is the readOGR function as in readOGR( dsn="../Data", layer = "gz_2010_us_050_00_20m" ) where the dsn= parameter points to the directory where the shapefile resides. Note too that the shapefile’s extension is omitted from the name passed to the layer= parameter.
The Census shapefile has a field called GEO_ID that includes the county FIPS codes along with other superfluous county ID values. We only need the five digit county FIPS code so we’ll extract these codes into a new column called FIPS. We will also assign a unique ID value to each row of the dataset for use in a subsequent fortify() function call.
shapefile$FIPS <- as.numeric(substr(shapefile$GEO_ID, 10, 14))
shapefile$id <- rownames(shapefile@data)To plot the shapefile in ggplot, we need to convert the SpatialPolygonsDataFrame object to a dataframe using the fortify() function.
cnty2 <- fortify(shapefile, id="id")Next we can append the income table to the new cnty2 object.
cnty2.cs <- cnty2 %>%
inner_join(shapefile@data, by="id") %>%
inner_join(df1, by="FIPS")Now we can plot the income distribution map.
ggplot(cnty2.cs, aes(long, lat,group = group)) +
geom_polygon(aes(fill = B20004001), colour = rgb(1,1,1,0.2)) +
coord_quickmap() + xlim(-125, - 67) + ylim( 25,50 )Depending on the size of your output extent, you might find the county outlines distracting–especially in the northeastern region where the county sizes are so small. We can omit the outline by setting the colour parameter to NA.
ggplot(cnty2.cs, aes(long, lat,group = group)) +
geom_polygon(aes(fill = B20004001), colour = NA) +
coord_quickmap() + xlim(-125, - 67) + ylim( 25,50 ) ggplot defaults to a continuous color scheme that ranges from a dark blue to a light blue. Continuous color schemes can be modified in ggplot by using the scale_fill_gradient function. For example, to use a light yellow to dark green color scheme modify the above chunk of code as follows:
ggplot(cnty2.cs, aes(long, lat,group = group)) +
geom_polygon(aes(fill = B20004001), colour = NA) +
coord_quickmap() + xlim(-125, - 67) + ylim( 25,50 ) +
scale_fill_gradient(low="lightyellow2", high="darkgreen")To see a list of available color names in R, type colours() at the console.
You can also create a discrete color scheme–i.e. one where the number of colors are fixed by a finite set of income value intervals. For example, to split the income values across six intervals (e.g. $0 to $20000, $20000 to $30000, $30000 to $40000, $40000 to $50000 , $50000 to $60000 and $60000 to $75000) we can use the cut() function to assign each income value in the table to one of the intervals, then add that interval value to the ctny2.cs map data frame object.
interv <- c(0,20000,30000, 40000,50000, 60000, 75000)
cnty2.cs$Income <- cut(cnty2.cs$B20004001, breaks = interv , labels = paste(interv[-1]))The labels= parameter allows us to define the output interval names. Here we assign the interval’s top dollar value to the interval names. Note that we could have created fancier intervals such as $0 to $20k, $20k to $30k, etc…
Next we’ll map the data, but first we need to change the coloring scheme since we are now using a discrete color scheme as opposed to a continuous one. Cynthia Brewer of Penn State designed a website called http://colorbrewer2.org/ that recommends discrete color schemes to use for different map effects. Defining the colors, especially for a sequential color scheme where the hue is kept constant while its lightness level is modified, is not trivial. Fortunately, her color scheme recommendations are available in ggplot via the scale_fill_brewer() function. To see a list of Brewer color schemes, simply type display.brewer.all() in the console window. Note that we will need to load the RColorBrewer package first.
library(RColorBrewer)
display.brewer.all()We’ll choose the YlGn color scheme. In case you did not run the last code chunk, we’ll load the RColorBrewer package once more.
library(RColorBrewer)
ggplot(cnty2.cs, aes(long, lat,group = group)) +
geom_polygon(aes(fill = Income), colour = NA) +
coord_quickmap() + xlim(-125, - 67) + ylim( 25,50 ) +
scale_fill_brewer(palette = "YlGn")We might choose to map the income distribution using a divergent color scheme where we compare the counties to some overall value such as the counties’ median. We can use the quantile() function to break the income data into its percentiles. This will be helpful in generating symmetrical intervals about the median.
inc.qt <- quantile(df1$B20004001, probs = c(0, 0.125, .25, 0.5, .75, .875 , 1 ))We’ll create new income intervals centered on the median of 30767.
cnty2.cs$Income <- cut(cnty2.cs$B20004001, breaks = inc.qt , labels = paste(inc.qt[-1]))A divergent color scheme consists of two different hues whose lightness values converge towards some central value (the median income in our case). We’ll use Brewer’s red-to-green color palette RdYlGn for this next map.
library(RColorBrewer)
ggplot(cnty2.cs, aes(long, lat,group = group)) +
geom_polygon(aes(fill = Income), colour = NA) +
coord_quickmap() + xlim(-125, - 67) + ylim( 25,50 ) +
scale_fill_brewer(palette = "RdYlGn")The divergent map gives us a different handle on the data. Here, we can identify the poorer than average regions such as the southeast and parts of the Mississippi river region as well as the wealthier regions such as the northeast (with the exception of Maine).